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Abstract 

Sense tagging, the automatic assignment of 
the appropriate sense from some lexicon to 
each of the words in a text, is a specialised 
instance of the general problem of seman- 
tic tagging by category or type. We discuss 
which recent word sense disambiguation al- 
gorithms are appropriate for sense tagging. 
It is our belief that sense tagging can be 
carried out effectively by combining several 
simple, independent, methods and we in- 
clude the design of such a tagger. A proto- 
type of this system has been implemented, 
correctly tagging 86% of polysemous word 
tokens in a small test set, providing evi- 
dence that our hypothesis is correct. 

1 Sense tagging 

This workshop is about semantic tagging: marking 
each word tokenQ in a text with some marker identi- 
fying its semantic category, similar to the way a part- 
of-speech tagger assigns a grammatical category to 
each token in a text. Our recent work has been con- 
cerned with sense tagging, a particular instance of 
this problem. Sense tagging is the process of assign- 
ing, to each content word in a text, its particular 
sense from some lexicon. This differs from the more 
general case of semantic tagging, where the tags for 
each word (type) are not be specific to that type and 
do not correspond to word senses in a lexicon. For 
example the tags may be broad semantic categories 
such as HUMAN or ANIMATE or WordNet synsets. 

Another, broader, class of algorithms are word 
sense disambiguation (WSD) algorithms. By WSD 
algorithm we mean any procedure which carries out 
semantic disambiguation on words, these may not 
necessarily be tagging algorithms, in that they do 



not attempt to mark every token in a text but may 
be restricted to disambiguating small sets of word 
types. 

Sense tagging is a difficult problem: each word 
(type) has its own set of tags which may be quite 
large. This rules out approaches which rely on a dis- 
criminator being created for each semantic tag which 
is then applied to text, although this is a valuable 
technique when there are a small number of tags 
which are broad semantic categories. 

However, sense tagging is an extremely useful pro- 
cedure to carry out since the tags which are associ- 
ated during sense tagging are rich in knowledge and 
therefore likely to be extremely useful for further 
processing. Indeed, the lack of reliable, large-scale, 
sense taggers has often been blamed for the failure 
of machine translation for the last 30 years. 

In this paper we shall discuss some recent ap- 
proaches to the WSD problem and examine their 
usefulness for the more specialised task of sense tag- 
ging. We then propose an approach which makes 
use of several different types of information and re- 
port a partial implementation of this system which 
produces very encouraging results. 

2 Recent Word Sense 

Disambiguation algorithms 

Recent word sense disambiguation (WSD) algo- 
rithms can be categorised into two broad types: 

1. WSD using information in an explicit lexicon. 
This is usually a Machine Readable Dictio- 
nary (MRD) such as the Longman Dictionary 



of Contemporary English (LDOCE) (Procter 
"19781 ), WordNet ( [Miller (Ed.), 1990| ) 



or hand- 
crafted. Recent examples of this work include 



flBruce and Guthrie , 199% (|B ruce and Wiebe 



1994), (McRoy, 1992) 



1 Often loosened to each content word in a text. 



2. WSD using information gained from training on 



some corpus. This approach can be further sub- 
classified: 

(a) Supervised training, where information is 
gathered from corpora which have already 
been semantically disambiguated. As such 
corpora are hard to obtain, usually re- 
quiring expensive hand-tagging, research in 
this area has concentrated on other forms 
of lexical ambiguities, eg. (Gale, Church 



and Yarowsky, 1992) 



(b) Unsupervised training, where information 
is gathered from raw corpora which has 
not been semantically disambiguated. The 
best examples of this approac h has been 
the r es ent work of Yaro w sky - ( Yarowsky. 



19921) , (|Yarowsky, 199J ), QYarowsky, 1995| ) 



These approaches are not mutually exclusive and 
there are, of course, some hybrid cases, for example 



Luk ( Luk, 1995 ) uses information in MRD defini- 
tions (approach |l|) and statistical information from 
untagged corpora (approach pb| ) . 

3 Comparing Different Approaches 

Approach ^a| is the least promising since text tagged 
with word senses is practically non-existent and is 
both time consuming and difficult to produce manu- 
ally. Much of the research in this area has been com- 
promised by the fact that researchers have focussed 
on lexical ambiguities that are not true word sense 
distinctions, such as words translated differently 
across two languages ( Gale, Church, and Yarowsky] 
1992] ) or homophonespl ( |Yarowsky, 1993| ). 

Even in the cases where data with the appropriate 
sense distinctions is available, the text is unlikely to 
be from the desired domain: a word sense discrim- 
inator trained on company news text will be much 
less effective on text about electronics products. A 
discriminator trained on many types of text so as to 
be generic will not be particularly successful in any 
specific domain. 

Approach [2b| has received much attention recently. 
Its disadvantage is that sense disambiguation is not 
carried out relative to any well defined set of senses, 
but rather an ad hoc set. Although this research 
has been the most successful of all approaches, it is 
difficult to see what use could be made of the word 
sense distinctions produced. 

Using approach [l] with hand crafted lexicons has 
the disadvantage of being expensive to create: in- 
deed Small and Rieger ( [Small and Rieger, 1982 ) 
attempted WSD using "word experts" , which were 

2 Words pronounced identically but spelled differently. 



essentially hand crafted disambiguators. They re- 
ported that the word expert for "throw" is "cur- 
rently six pages long, but should be ten times that 
size" , making this approach impractical for any sys- 
tem aiming for broad coverage. 

4 Proposed Approach 

Word senses are not absolute or Platonic but defined 
by a given lexicon, as has been known for many years 
from early work on WSD, even though the contrary 
seems widely believed: ".. it is very difficult to assign 
word occurrences to sense classes in any manner that 
is both general and determinate. In the sentences "I 
have a stake in this country." and "My stake in the 
last race was a pound" is "stake" being used in the 
same sense or not? If "stake" can be interpreted 
to mean something as vague as 'Stake as any kind 
of investment in any enterprise' then the answer is 
yes. So, if a semantic dictionary contained only two 
senses for "stake": that vague sense together with 
'Stake as a post', then one would expect to assign the 
vague sense for both the sentences above. But if, on 
the other hand, the dictionary distinguished 'Stake 
as an investment' and 'Stake as an initial payment in 
a game or race' then the answer would be expected 
to be different. So, then, word sense disambiguation 
is relative to the dictionary of sense choices available 



and c an have no absolute quality about it." flWilks 



1972| ) 



There is no general agreement over the number of 
senses appropriate for lexical entries: at one end of 



the spectrum Wierzbicka (Wierzbicka, 1989) claims 
words have essentially one sense while Pustejovsky 
believes that "... words can assume a potentially 
infinite number of senses in context." (Pustejovsky 



1995) How, then, are we to get an initial lexicon of 
word senses? We believe the best resource is still 
a Machine Readable Dictionary: they have a rela- 
tively well-defined set of sense tags for each word 
and lexical coverage is high. 

MRDs are, of course, normally generic, and much 
practical WSD work is for sub-domains. We are ad- 
hering to the view that it is better to start with such 
a generic lexicon and adapt it automatically with 
specialist words and senses. The work described here 



is part of ECRAN flWilks, 1995IJ , a European LRE 
project on tuning lexicons to domains, with a gen- 
eral sense tagging module used as a first stage. 

5 Knowledge Sources 

An interesting fact about recent word sense disam- 
biguation algorithms is that they have made use of 
different, orthogonal, sources of information: the in- 



formation provided by each source seems indepen- 
dent of and has no bearing on any of the others. We 
propose a tagger that makes use of several types of 
information (dictionary definitions, parts-of-speech, 
domain codes, selectional preferences and collocates) 
in the tradition of McRoy flMcRoy, 199^ ) although, 
the information sources we use are orthogonal, un- 
like the sources she used, making it easier to evaluate 
the performance of the various modules. 

5.1 Part-of-speech 

It has already been shown that part-of-speech tags 
are a useful discriminator for se mantic disambigua- 
tion ( Wjlks and Stevenson, 199€ ) , although they are 
not, normally, enough to fully disambiguate a text. 
For example knowing " bank" in " My bank is on the 
corner." is being used as a noun will tell us that the 
word is not being used in the 'plane turning cor- 
ner' sense but not whether it is being used in the 
'financial institution' or 'edge of river' senses. Part- 
of-speech tags can provide a valuable step towards 
the solution to sense tagging: fully disambiguating 
about 87% of ambiguous word tokens and reducing 
the ambiguity for some of the rest. 

5.2 Domain codes (Thesaural categories) 

Pragmatic domain codes can be used to disam- 
biguate (usually nominal) senses, as was shown by 
flBrucc and Guthrie, 19921 ) and ( |Yarowsky, 1992j ). 
Our intuition here is that disambiguation evidence 
can be gained by choosing senses which are closest in 
a thesaural hierarchy. Closeness in such a hierarchy 
can be effectively expressed as the number of nodes 
between concepts. We are implementing a simple 
algorithm which prefers close senses in our domain 
hierarchy which was derived from LDOCE (Bruce 
and Guthrie, 1992|). 



we hope to be reasonably confident of the senses of 
nouns in the text from processes such as 5.2 and [Tq. 



5.3 Collocates 

Recent work has been done using collocations as 



sema ntic disambiguators, (Yarowsky, 1993), (Dorr 



I99q ), particularly for verbs. We are attempting to 



derive disambiguation information by examining the 
prepositions as given in the subcategorization frames 
of verbs, and in the example sentences in LDOCE. 

5.4 Selectional Preferences 

There has been a long tradition in NLP of u sing se- 
lectional preferences for WSD flWilks 1972]) . This 
appr oach has been recently u sed by ( McRoy, 1992 ) 
and ( Mahesh and Beale, 1996 ). At its best it disam- 
biguates both verbs, adjectives and the nouns they 
modify at the same time, but we shall use this in- 
formation late in the disambiguation process when 



5.5 Dictionary definitions 



Lesk ( Lesk, 1986 ) proposed a method for seman- 
tic disambiguation using the dictionary definitions 
of words as a measure of their semantic closeness 
and proposed the disambiguation of sentences by 
computing the overlap of definitions for a sentence. 
Simmulated annealing, a numerical optimisation al- 
gorithm, was used to make this process practical 
( Cowie, Guthrie, and Guthrie, 1992 ), choosing an 
assignment of senses from as many as 10 10 choices. 

The optimisation is carried out by minimising an 
evaluation function, computed from the overlap of 
a given configuration of senses. The overlap is the 
total number of times each word appears more than 
once in the dictionary definitions of all the senses 
in the configuration. So that if the word "bank" 
appeared three times in a given configuration we 
would add two to the overlap total. This function 
has the disadvantage that longer definitions are pref- 
ered over short ones, since these simply have more 
words which can contribute to the overlap. Thus 
short definitions or definitions by synonym are pe- 
nalised. 

We attempted to solve this problem by making 
a slight change to the method for calculating the 
overlap. Instead of each word contributing one we 
normalise it's contribution by the number of words 
in the definition it came from, so if a word came from 
a definition with three words it would add one third 
to the overlap total. In this way long definitions have 
to have many words contributing to the total to be 
influential and short definitions are not penalised. 

We found that this new function lead to a small 
improvement in the results of the disambiguation, 
however we do not believe this to be statistically 
significant. 



6 A Basic Tagger 

We have recently implemented a basic version of 
this tag ger, initially incorporating onl y th e part-of- 
speech (5.1) and dictionary definition (5.5) stages in 
the process, with further stages to be added later. 
Our tagger currently consists of three modules: 



• Dictionary look-up module 

• Part-of-speech filter 

• Simulated annealing 

1. We have chosen to use the machine readable 
version of LDOCE as our lexicon. This has been 



used extensively in NLP research and provides 
a broad set of senses for sense tagging. 

The text is initially stemmed, leaving only mor- 
phological roots, and split into sentences. Then 
words belonging to a list of stop words (prepo- 
sitions, pronouns etc.) are removed. For each 
of the remaining words, each of its senses are 
extracted from LDOCE and stored with that 
word. The textual definitions in each sense is 
processed to remove stop words and stem re- 
maining words. 



2. The text is tagged using the Brill tagger ( [Brill 



1992 ) and a translation is carried out using a 



manually defined mapping from the syntactic 
t ags assigned by Brill (Penn Tree Bank tag s 



( Marcus, Santorini, and Marcinkiewicz, 199 j )) 
onto the simpler part-of-speech categories asso- 
ciated with LDOCE senses. We then remove 
all senses whose part-of-speech is not consistent 
with the one assigned by the tagger, if none of 
the senses are consistent with the part-of-speech 
we assume the tagger has made an error and do 
not remove any senses. 



3. The final stage is to use the simulated anneal- 
ing algorithm to optimise the dictionary defi- 
nition overlap for the remaining senses. This 
algorithm assigns a single sense to each token 
which is the tag associated with that token. 

7 Example Output 

Below is an example of the senses assigned by the 
system for the sentence "A rapid rise in prices soon 
eventuated unemployment." We show the homo- 
graph and sense numbers from LDOCE with the 
stemmed content words from the dictionary defini- 
tions which are used to calculate the overlap follow- 
ing the dash. 

• rapid homograph 1 sense 2 - done short time 

• rise homograph 2 sense 1 - act grow greater 
powerful 

• soon homograph sense 1 - long short time 

• prices homograph 1 sense 1 - amount money 
which thing be offer sell buy 

• unemployment homograph sense 1 - condition 
lack job 

The senses have additional information associated 
which we do not show here: domain codes, part of 



speech and grammatical information as well as se- 
mantic information. 

The senses for a word in LDOCE are grouped 
into homographs, sets of senses realeated by mean- 
ing. For example, one of the homographs of "bank" 
means roughly 'things piled up', the different senses 
distinguishing exactly what is piled up. 

8 Results 

We have conducted some preliminary testing of 
this approach: our tests were run on 10 hand- 
disambiguated sentences from the Wall Street Jour- 
nal amounting to a 209 word corpus. We found 
that of, the word tokens which had more than 1 
homograph, 86% were assigned the correct homo- 
graph and 57% of tokens were assigned the cor- 
rect sense using our simple tagger. These figures 
should be compared to 72% correct homograph as- 
signment and 47% correct sense assignment using 
simulated annealing alone on the same test set (see 



(Cowie, Guthrie, and Guthrie, 1992)). It should be 



noted that the granularity of sense distinctions at 
the LDOCE homograph level (eg. "bank" as 'edge 
of river' or 'financial institution') is the same as the 
distinctions m ade by current small-scale WSD algo - 
rithms ( eg. (Gale, Church, and Yarowsky, 1992 ), 



( Yarowsky, 1993 ), ( Schutze, 1992 )) and our system 
is a true tagging algorithm, operating on free text. 

Our evaluation is unsatisfactory due to the small 
test set, but does demonstrate that the use of inde- 
pendent knowledge sources leads to an improvement 
in the quality of disambiguation. We fully expect 
our results to improve with the addition of further, 
independent, modules. 



9 Conclusion 

In this paper we have argued that semantic tagging 
can be carried out only relative to the senses in some 
lexicon and that a machine readable dictionary pro- 
vides an appropriate set of senses. 

We reported a simple semantic tagger which 
achieves 86% correct disambiguation using two inde- 
pendent sources of information: part-of-speech tags 
and dictionary definition overlap. A proposal to ex- 
tend this tagger is developed, based on other, mutu- 
ally independent, sources of lexical information. 
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